Contents

1 Motivation and background

Researchers have been proposing cancer taxonomies using omic profiles generated in bulk-tissue, and thus, whereas classifications based on somatic mutations and copy-number alterations reflect the underlying patterns of genomic alterations in the tumor cell-of-origin, classifications derived from bulk transcriptomic and/or epigenetic data are inevitably confounded by cell-type heterogeneity (CTH). Moreover, CTH means that it is much harder to pinpoint whether specific cancer-associated transcriptomic or epigenetic changes are happening in the tumor cell-of-origin and not in tumor stroma. Therefore, it is critical to explore novel classifications of tumor-types in terms of the cell-type specific transcriptomic and epigenetic changes for improved understanding of how distinct cancer subtypes emerge in relation to the functional changes that happen in the tumor cell of origin and in the various types of tumor-associated stromal cells.

Here we propose a novel strategy, based on the concept of “cell-type specific clustering” (CELTYC), to refine the molecular classification of cancer-types. The key innovative idea behind this proposal is that we can identify the features (CpGs/genes) driving disease-relevant cell-type specific variation, so that clustering over these features results in disease subtypes that are not confounded by variations in cell-type composition. We evaluate the above CELTYC strategy in the context of DNA methylation (DNAm) data. By applying our novel strategy to liver hepatocellular, we reveal novel biologically and clinically relevant tumor classifications.

Basically, CELTYC workflow includes:

  1. Estimating cell-type fractions for bulk DNAm data;
  2. Identifying disease/cancer-assicated DMCs altered in specific cell types (i.e. DMCTs);
  3. Generating standardized residual matrix from original DNAm matrix, and doing clustering on the standardized residual matrix over DMCTs (could be DMCTs unique to one cell type or DMCTs shared by different cell types)

2 Tutorial Example

In order to run the tutorial we must first load the necessary package and data. We have saved all the necessary data objects in a list named LIHC_data.

2.1 Loading in the data

knitr::opts_chunk$set(crop = NULL)
library(CELTYC)
data("LIHC_data") # load the data used in the following analysis

2.2 Estimating cell-type fractions for the LIHC samples

The first step of CELTYC is to estimate cell-type fractions from bulk DNA methylation (DNAm) data. Here we use DNAm data bmiq.m, a matrix with rows of CpGs and columns of samples, for Liver Hepatocellular Carcinoma (LIHC) samples (cancer and normal) for demonstration. To estimate cell-type fractions, for solid tissue, we use an R package EpiSCORE. Because the full LIHC dataset bmiq.m is too large for inclusion here, we just display the syntax:

library(EpiSCORE)
avDNAm.m <- constAvBetaTSS(bmiq.m,type="450k") #computing the average DNAm over a window 200bp upstream of TSS, or if not available over 1st exon probes.
estF.o <- wRPC(avDNAm.m,ref=mrefLiver.m,useW=TRUE,wth=0.4,maxit=500) # Estimating cell-type fraction with a DNAm reference matrix for liver via weighted robust linear regression
estF.m <- estF.o$estF

2.3 Identifying cell-type specific differentially methylated CpGs (DMCTs)

Next, we want to identify DMCTs employing the previously published CellDMC algorithm (EpiDISH::CellDMC()). Running EpiDISH::CellDMC requires the input of abovementioned matrix bmiq.m, and the use of the phenotype information (pheno.df, a data.frame with rows matched for bmiq.m), and again here we only display the syntax:

library(EpiDISH)
phe.v <- pheno.df$cancer # the cancer status for samples
phe.v[which(phe.v=="Cancer")] <- 1
phe.v[which(phe.v=="Normal")] <- 0
phe.v <- as.numeric(phe.v)
sex.v <- pheno.df$gender
sex.v[which(sex.v=="MALE")] <- 1
sex.v[which(sex.v=="FEMALE")] <- 2
sex.v <- as.numeric(sex.v)
age.v <- pheno.df$age_at_initial_pathologic_diagnosis
na.idx <- which(is.na(age.v)|is.na(sex.v)) # the running of CellDMC does not allow NA in input

cov.mod.use <- model.matrix(~sex.v[-na.idx]+age.v[-na.idx]) # exclude the confounding effect of age and sex
celldmc.o <- CellDMC(beta.m = bmiq.m[,-na.idx],pheno.v=phe.v[-na.idx],frac.m  = estF.m[-na.idx,],
                     mc.cores = 40,cov.mod = cov.mod.use)
dmct.lv <- list() # save DMCTs for each cell type
for(ct in colnames(celldmc.o[["dmct"]])[2:ncol(celldmc.o[["dmct"]])]){
    dmct.lv[[ct]] <-  rownames(celldmc.o[["dmct"]])[which(celldmc.o[["dmct"]][,ct]!=0)]
}
dmct.res <- list(celldmc_res=celldmc.o,dmct=dmct.lv)

dmct.res contains 2 list objects. The 1st list is the returned result of EpiDISH::CellDMC(), and the 2nd one is a list of identified DMCTs for each cell type. We have saved the 2nd list dmct.res$dmct as LIHC_data$allDMCT. To visualize the number of DMCTs for each cell-type and their overlaps, we can run:

library(SuperExactTest)
supertest.o <- supertest(LIHC_data$allDMCT,n=403197)
plot(supertest.o,"landscape",sort.by = "size",color.scale.cex = 0.7,overlap.size.cex = 0.6,cex=0.7)

2.4 Performing cell-type specific clustering

Now that we have identified cell-type specific methylation CpG sites associated with cancer status, we can perform cell-type specific clustering on DNAm data over different sets of DMCTs for cancer-status samples, after we regress out estimated CTF. Here we can use the prepared DNAm matrix over a subset of CPG probes LIHC_data$DNAm and the cell-type fraction matrix LIHC_data$estF with matched samples to LIHC_data$DNAm for further analysis. We now try to do clustering restricting to 3 sets of DMCTs (stored in a list object LIHC_data$selDMCT): DMCTs exclusively altered with disease status in lymphocytes, hepatocytes and endothelial cells respectively.

res.m <- GenResidualMat(LIHC_data$DNAm,estCTF.m = LIHC_data$estF,standardize = T,ncores = 40) # regress out CTFs from DNAm matrix and get standardized residual matrix
CELTYC.results.l <- DoCELTYC(res.m,method = "consensus",maxK = 3,dmct.lv = LIHC_data$selDMCT[c("Lym","Hep","EC")],title = "cluster-test") # do consensus clustering on standardized residual matrix over DMCTs specific to lymphocytes, hepatocytes and endothelial cells
## Start doing consensus clustering on standardized residual matrix over input DMC list.
## end fraction
## clustered
## clustered
## end fraction
## clustered
## clustered
## end fraction
## clustered
## clustered

We can check the sample composition of clustering results using DMCTs for different cell types, and generate a heatmap of scaled residual matrix labeled with cluster index. We can look at the pre-generated heatmap clustHeatmap here:

library(ComplexHeatmap)
lym.clust.v <- CELTYC.results.l$Lym[[3]]$consensusClass
hep.clust.v <- CELTYC.results.l$Hep[[3]]$consensusClass
ec.clust.v <- CELTYC.results.l$EC[[2]]$consensusClass # if K=3, the 3rd cluster contains few samples
print("Compare the lym-clusters and hep-clusters:")
## [1] "Compare the lym-clusters and hep-clusters:"
table(lym.clust.v,hep.clust.v)
##            hep.clust.v
## lym.clust.v  1  2  3
##           1 82 50 21
##           2 57 13 19
##           3 75 52 10
print("Compare the lym-clusters and EC-clusters:")
## [1] "Compare the lym-clusters and EC-clusters:"
table(lym.clust.v,ec.clust.v)
##            ec.clust.v
## lym.clust.v   1   2
##           1 145   8
##           2  87   2
##           3  19 118
print("Compare the hep-clusters and EC-clusters:")
## [1] "Compare the hep-clusters and EC-clusters:"
table(hep.clust.v,ec.clust.v)
##            ec.clust.v
## hep.clust.v   1   2
##           1 142  72
##           2  68  47
##           3  41   9
ComplexHeatmap::draw(LIHC_data$heatmap)

2.5 Performing survival analysis for CELTYC clusters

Next we want to see whether the newly obtained clusters using different sets of DMCTs are significantly associated with overall survival. We use the prepared data.frame object LIHC_data$pheno storing the phenotype information including O.S. time and death event:

library(survival)

# for clustering results using DMCTs for different cell types:
clust.l <- list(Lym=lym.clust.v,Hep=hep.clust.v,EC=ec.clust.v)
surv.res.l <- list()
for(m in 1:3){
  clust.v <- clust.l[[m]]
  chi.surv.p.v <- vector()
  tmp.v <- vector()
  count <- length(unique(na.omit(clust.v)))
  cl <- sort(unique(clust.v))

  # generate pairwise P values comparing survival probability between different clusters
  for(i in 1:(count-1)){
    for(j in (i+1):count){
      tmp.time <- LIHC_data$pheno$OS.time[clust.v %in% c(cl[i],cl[j])]
      tmp.event <- LIHC_data$pheno$event[clust.v %in% c(cl[i],cl[j])]
      tmp.clust.v <- clust.v[clust.v %in% c(cl[i],cl[j])]
      surv.o <- Surv(time = tmp.time,event = tmp.event)
      cox.o <- coxph(surv.o~tmp.clust.v)
      chisq.pval <- pchisq(cox.o$score,1,lower.tail = F)
      chi.surv.p.v <- c(chi.surv.p.v,chisq.pval)
      tmp.v <- c(tmp.v,paste0("cl ",cl[i],",",cl[j]))
    }
  }  
  names(chi.surv.p.v) <- tmp.v
  # generate overall KM-curve plot to see the difference of survival rate between clusters
  surv.o <- Surv(time = LIHC_data$pheno$OS.time,event = LIHC_data$pheno$event)
  survfit.o <- survfit(surv.o ~ clust.v) 
  surv.res.l[[m]] <- list()
  surv.res.l[[m]][["pair-Pval"]] <- chi.surv.p.v
  surv.res.l[[m]][["survfit"]] <- survfit.o
}

par(mar=c(2,2,2,2))
# for clustering results using lym DMCTs:
plot(surv.res.l[[1]]$survfit, col = c("firebrick","dodgerblue","orange"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.6,cex.axis=0.6,mgp = c(1, 0.5, 0),tck = -0.03,main="Lym clusters",cex.main=0.7) 
text(x = 1400,y=0.1,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[1]," Chisq P=",signif(surv.res.l[[1]]$`pair-Pval`[1],2)),cex = 0.6)
text(x = 1400,y=0.15,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[2]," Chisq P=",signif(surv.res.l[[1]]$`pair-Pval`[2],2)),cex = 0.6)
text(x = 1400,y=0.2,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[3]," Chisq P=",signif(surv.res.l[[1]]$`pair-Pval`[3],2)),cex = 0.6)

# for clustering results using Hep DMCTs:
plot(surv.res.l[[2]]$survfit, col = c("#FF8080","#80FF80","#8080FF"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.6,cex.axis=0.6,mgp = c(1, 0.5, 0),tck = -0.03,main="Hep clusters",cex.main=0.7) 
text(x = 1400,y=0.1,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[1]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[1],2)),cex = 0.6)
text(x = 1400,y=0.15,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[2]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[2],2)),cex = 0.6)
text(x = 1400,y=0.2,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[3]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[3],2)),cex = 0.6)

# for clustering results using EC DMCTs:
plot(surv.res.l[[3]]$survfit, col =  c("#F06000","#408030"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.6,cex.axis=0.6,mgp = c(1, 0.5, 0),tck = -0.03,main="EC clusters",cex.main=0.7) 
text(x = 1400,y=0.1,label=paste0(names(surv.res.l[[3]]$`pair-Pval`)[1]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[1],2)),cex = 0.6)

The results we obtain show that CELTYC enables us to divide LIHC samples into different clusters with clear segregation of survival rate when we use DMCTs specific to lymphocytes. However, no distinguishable segregation was observed when employing DMCTs specific to the other two cell types.

2.6 Performing cell-type specific clustering with an alternative method ("jive")

In the example above we do consensus clustering directly on standardized residual matrix over DMCTs specific to lymphocytes, hepatocytes and endothelial cells, by setting the parameter "method" to be "consensus" in function DoCELTYC(), but meanwhile we can also implement a statistical procedure called JIVE (Joint and Individual Variation Explained) that extracts out components of joint variation across all data matrices as well as components of individual variation that are unique to each cell-type or unique to specific combinations of cell-types, utilizing an R-package r.jive. To conduct jive analysis, one needs to prepare separate matrices with non-overlapping features, and in this case we can use the prepared list LIHC_data$selDMCT which stores the DMCTs for lymphocytes, hepatocytes and endothelials that are not shared by other two cell types, and also DMCTs shared by all three cell types. To save the running time, we only display the syntax for this method here:

res.m <- GenResidualMat(LIHC_data$DNAm,estCTF.m = LIHC_data$estF,standardize = T,ncores = 40)
jive.results.l <- DoCELTYC(res.m,method = "jive",maxK = 3,dmct.lv = LIHC_data$selDMCT,title = "cluster-test")
jive.clust.v <- jive.results.l$Lym[[2]]$consensusClass # extract the clustering results on JIVE-derived individual matrix for lymphocyte-specific DMCTs, when cluster number=3

3 Session Info

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS:   /usr/local/lib64/R/4.4.4/lib64/R/lib/libRblas.so 
## LAPACK: /usr/local/lib64/R/4.4.4/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.utf-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.utf-8        LC_COLLATE=en_US.utf-8    
##  [5] LC_MONETARY=en_US.utf-8    LC_MESSAGES=en_US.utf-8   
##  [7] LC_PAPER=en_US.utf-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.utf-8 LC_IDENTIFICATION=C       
## 
## time zone: Asia/Shanghai
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] survival_3.6-4        ComplexHeatmap_2.19.0 SuperExactTest_1.1.0 
## [4] CELTYC_0.1.0          BiocStyle_2.31.0     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9                  gplots_3.1.3.1             
##  [3] ConsensusClusterPlus_1.58.0 bitops_1.0-7               
##  [5] KernSmooth_2.23-22          shape_1.4.6.1              
##  [7] gtools_3.9.5                lattice_0.22-6             
##  [9] magrittr_2.0.3              digest_0.6.35              
## [11] caTools_1.18.2              evaluate_0.23              
## [13] RColorBrewer_1.1-3          bookdown_0.39              
## [15] iterators_1.0.14            circlize_0.4.16            
## [17] fastmap_1.1.1               Matrix_1.7-0               
## [19] foreach_1.5.2               doParallel_1.0.17          
## [21] jsonlite_1.8.8              GlobalOptions_0.1.2        
## [23] BiocManager_1.30.22         codetools_0.2-20           
## [25] jquerylib_0.1.4             abind_1.4-5                
## [27] cli_3.6.2                   rlang_1.1.3                
## [29] crayon_1.5.2                Biobase_2.63.1             
## [31] splines_4.4.0               cachem_1.0.8               
## [33] yaml_2.3.8                  tools_4.4.0                
## [35] parallel_4.4.0              colorspace_2.1-0           
## [37] BiocGenerics_0.49.1         GetoptLong_1.0.5           
## [39] R6_2.5.1                    png_0.1-8                  
## [41] magick_2.8.3                matrixStats_1.3.0          
## [43] stats4_4.4.0                lifecycle_1.0.4            
## [45] S4Vectors_0.41.7            IRanges_2.37.1             
## [47] clue_0.3-65                 cluster_2.1.6              
## [49] bslib_0.7.0                 Rcpp_1.0.12                
## [51] highr_0.10                  xfun_0.43                  
## [53] r.jive_2.4                  knitr_1.46                 
## [55] rjson_0.2.21                htmltools_0.5.8.1          
## [57] rmarkdown_2.26              Cairo_1.6-2                
## [59] compiler_4.4.0